arXiv:1505.04873vl [cs.CV] 19 May 2015 


Have a Look at What I See 


Lior Talker Yael Moses Ilan Shimshoni 

The University of Haifa, Israel The Interdisciplinary Center, Israel The University of Haifa, Israel 
ItalkeOl @campus.haifa.ac.il yael@idc.ac.il ishimshoni@mis.haifa.ac.il 

talker.lior@idc.ac.il 



i i 







1 





Initial Initial + interface First iter. + interface Final Destination 


Abstract 

We propose a method for guiding a photographer to ro¬ 
tate her/his smartphone camera to obtain an image that 
overlaps with another image of the same scene. The other 
image is taken by another photographer from a different 
viewpoint. Our method is applicable even when the im¬ 
ages do not have overlapping fields of view. Straightfor¬ 
ward applications of our method include sharing attention 
to regions of interest for social purposes, or adding missing 
images to improve structure for motion results. Our solu¬ 
tion uses additional images of the scene, which are often 
available since many people use their smartphone cameras 
regularly. These images may be available online from other 
photographers who are present at the scene. Our method 
avoids 3D scene reconstruction; it relies instead on a new 
representation that consists of the spatial orders of the scene 
points on two axes, x and y. This representation allows 
a sequence of points to be chosen efficiently and projected 
onto the photographer’s images, using epipolarpoint trans¬ 
fer. Overlaying these epipolar lines on the live preview of 
the camera produces a convenient interface to guide the 
user. The method was tested on challenging datasets of im¬ 
ages and succeeded in guiding a photographer from one 
view to a non-overlapping destination view. 

1. Introduction 

Assume Alice and Bob capture two images of non¬ 
overlapping sections of the same scene. Alice then says to 
Bob, “Have a look at what I see.” Can Bob rotate his cam¬ 
era to view the section of the scene viewed by Alice? In 
this paper we refer to the problem of computing this desired 


rotation and conveying it to the photographer as the camera 
guidance problem. 

The objective in the camera guidance problem is to guide 
Bob’s camera to rotate and capture a new image that signif¬ 
icantly overlaps with Alice’s image (i.e., the images share 
pixels that correspond to the same scene points). To sim¬ 
plify the objective, we aim to determine the rotation of 
Bob’s camera such that a scene point, P^, projected to the 
center of Alice’s image (the destination image) will also be 
projected to Bob’s newly captured image. 

We propose a method to compute the projection of the 
scene point, P^, to the initial image captured by Bob (the 
initial image). We then show how this point can be used to 
compute the desired rotation and convey it to the user via 
a simple interface. Our solution makes use of additional 
images of the scene that are assumed to be available from 
other photographers that are present in the scene. Such ad¬ 
ditional images are necessary to solve the problem when the 
destination and initial images do not overlap. 

Before we describe our method, let us first consider two 
existing methods that are natural candidates to compute the 
projection of P^ to the initial image’s plane. As we will 
show in Section 3 and demostrated in Section 5, Struc¬ 
ture from Motion (SFM) methods [23, 15, 26, 10, 20] and 
image-based panorama [8, 23], two seemingly natural can¬ 
didates for computing the projection of P^ to the initial im¬ 
age’s plane, are not effective in the scenarios we consider. 
The former is time-consuming and requires a large num¬ 
ber of images and the latter is not applicable to general 3D 
scenes such as the ones we consider here. 

We propose instead to efficiently choose a small partial 
set of images and use epipolar constraints to compute the 
projection of P^ to the initial image plane. When the gap 
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between the initial and the destination images is large, the 
goal is achieved through a sequence of intermediate views. 
Each such view corresponds to a rotation of the camera such 
that a projection of a chosen scene point is at the center 
region of the new captured image. 

To efficiently choose the partial set of images, we pro¬ 
pose a novel rough representation of the scene. The repre¬ 
sentation consists of the spatial orders of the scene points 
on two axes, x and y. We call this representation Spatially 
Ordered Feature Aggregation (SOFA). Computing SOFA is 
trivial if reliable feature correspondence is available and the 
spatial orders of the scene points are preserved in their pro¬ 
jections to the images. In this case, the axes of a single 
image can be used to represent the order of the features as 
all partial orders are consistent. However, all existing fea¬ 
ture matching methods are imperfect, and in practice, for 
most scenes, the order-preserving assumption does not hold. 
Hence we propose to combine the orderings of the features 
in the different views to obtain approximate global order¬ 
ings. To achieve this goal in a robust manner we use the 
rank aggregation method [13]. Rank aggregation methods 
were employed for combining the ranks of web pages ob¬ 
tained by several search engines and were previously used 
in computer vision for ranking the temporal order of im¬ 
ages [12]. 

Finally, we need to convey to the photographer how to 
rotate the camera towards the goal. When the camera is au¬ 
tomatically controlled (e.g., a camera mounted on a robot or 
a PTZ camera), the computed rotation can be used directly 
to guide it. However, here we consider hand-held cameras, 
where the user is incapable of following such instructions. 
This is true regardless of the manner in which the rotation 
is computed (SFM, panoramic view, or our method). We 
propose a user interface superimposed on the live preview 
of the camera, visually indicating the required rotation. 

Applications: People often like to share images with oth¬ 
ers. Indeed, this is one of the main appeals of Facebook, 
Instagram, etc. Our method allows a photographer to di¬ 
rect the attention of others present at the same scene to an 
interesting region, rather than sending an image of that re¬ 
gion. This may be more rewarding since the observer can 
capture his own image of the region, or follow events that 
take place in that region. It is applicable without any ver¬ 
bal or physical communication, for people who may be at 
different positions in the scene. The additional images of 
the scene can be captured by a crowd present at the scene 
or downloaded from available public photo collections. In 
many scenarios, however, a public photo collection is not 
available; hence a short preprocessing time is essential. A 
similar application is for tourism with a virtual guide. Our 
method can be used to direct the tourist to rotate his camera 
to view the section of the scene to which the guide refers. 


Our method can also be used in a homing applica¬ 
tion that helps people find their friends in a crowd (when 
GPS or WiFi based localizations are unavailable). Hom¬ 
ing algorithms have already been developed for robotics 
applications, but overlapping fields of view are required 
(e.g., [4,16]). Our method can be used to obtain such views. 

A novel setup that we envision is “collaborative pho¬ 
tography”, where a group of photographers attending the 
same event cooperate for solving a given task. For example, 
to compute high quality 3D structure of the scene, many 
overlapping images from different poses are desired. Our 
method can be used to obtain additional images on request 
(see the experiment in Sec. 5). 

The main contributions of the paper are (i) the introduc¬ 
tion of a new challenging task, the camera guidance prob¬ 
lem, and its efficient solution; (ii) the novel SOFA scene 
representation that allows the camera guidance problem to 
be solved efficiently while avoiding 3D scene reconstruc¬ 
tion; (iii) a novel visual user interface for smartphone cam¬ 
eras to guide the user to rotate his camera towards a given 
scene point. 

2. Additional Related Work 

To allow for geometric and spatial reasoning without di¬ 
rect 3D reconstruction, we use the spatial orderings of the 
scene points, obtained by rank aggregation. Rank aggrega¬ 
tion is the problem of finding a full ranking that agrees with 
multiple (full or partial) rankings, i.e., a consensus of rank¬ 
ings. It was traditionally studied in the context of social 
choice and voting theory [27], but was used also for bio¬ 
logical sequence alignment [7] and web page ranking [13]. 
Recently, it was used to temporally order a collection of im¬ 
ages of a dynamic scene [12, 11]. The common rank aggre¬ 
gation problem is known to be NP hard [13]. In our method, 
we use the Markov chain approximation for rank aggrega¬ 
tion, which was proven to be quite effective if the power 
iteration method is used [13]. 

The camera guidance problem seemingly resides in the 
field of “active vision” [9], where the goal is to change 
the pose of controlled cameras, e.g., cameras mounted on 
robots or PTZ cameras, to allow the optimization of some 
objective. For example, when the objective is object track¬ 
ing, fixation on objects over time is maintained through con¬ 
trol of the camera pose. A general approach for the simul¬ 
taneous tracking of multiple moving targets using a generic 
active stereo setup is studied by Barreto et al. [3]. As far 
as we know, none of the methods in the active vision field 
considered a set of casually taken photographs as in our 
scenario; moreover, the environment is usually strictly con¬ 
trolled in advance and then manipulated [6] . Another draw¬ 
back of these methods is the computational time, where full 
3D reconstruction of the scene is usually given in advance. 
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or exhaustively calculated. Our method may be also applied 
to robot collaboration. 

3. Method 

The input to our method is a pair of images, an initial im¬ 
age captured by camera C, and a destination image Id. 
In addition, a set of images of the scene, I = pos¬ 

sibly captured at (roughly) the same time, is available. The 
objective is to determine the rotation of C such that a scene 
point Pt projected to the center region of Id will also be 
projected to the center region of a new image captured by 
C, I^. To do so, it is sufficient to compute, the projec¬ 
tion of Pt to the image plane of Iq (not necessarily in the 
current FOV of C). A visual user interface is then used to 
assist the photographer in rotating C such that P^ is pro¬ 
jected to the center region of I^ (see Sec. 3.4). 

Before we describe our method, we discuss two alter¬ 
native methods and their limitations. A direct way to com¬ 
pute is by first recovering the projection matrices and the 
set of scene points, V, using structure from motion (SFM) 
methods (e.g., 122]). This solution is not applicable for on¬ 
line computation due to the long running time required, as 
we demonstrate in Sec. 5. Another alternative is to map 
all images to I^ using homography transformations. In this 
case, the location of may be obtained by p^ = Hp^ (p 
is the homogenous coordinates of p), where H is the ho¬ 
mography between I^ and Id. However, when the scene 
contains non-negligible 3D structure with respect to the dis¬ 
tance from the camera (not planar) or the cameras are sep¬ 
arated by considerable translation, a homography transfor¬ 
mation does not exist, as discussed in Sec. 5. 

Instead, we propose an efficient method to compute p^ 
that avoids 3D reconstruction and whose preprocessing time 
is in the order of minutes rather than hours as in SFM. We 
first consider the basic case where I^ can be captured in 
a single step. When the gap between I^ and Id is large, 
intermediate images have to be captured in order to reach 
I^ (see Sec. 3.2). Then we describe the user interface for 
guiding the camera rotation. 

3.1. Basic Case: Single Step 

We begin with the case that p% is computed in a sin¬ 
gle step. We propose to use a supporting subset of images, 
C X, that satisfies the following constraints: (i) each 
of the images Ij G has sufficient overlap with /^, that 
is, there are enough corresponding features to compute the 
fundamental matrix, Poj, between them; (ii) the point p^ 
is detected in each of the images Ij G I^. Given X^, the 
epipolar line that corresponds to p^ in image I^ is given by 
Ij = PojpT. The epipolar point transfer (EPT) 118] is used 
to compute p^, that is, the intersection of a pair of epipolar 
lines, ij and ik. Note that p^ is not necessarily within the 
FOV of Iq. For robustness, when \I^\ > 3, the intersection 


point with the most epipolar line inliers is found, and the 
outliers are discarded. When the number of epipolar lines is 
large, the RANSAC algorithm 114] is used. In our method 
we typically restrict |X^| < 5. 

3.2. General Case: Multiple Steps 

When the gap between I^ and Id is large, the supporting 
set of images, X^ C X, does not exist and the EPT cannot 
be used directly to compute p^. Hence, we suggest rotating 
C to a sequence of intermediate views, ..., until 
reaching the desired overlap with Id. A sequence of points 
Pi 5 • • • 5 Pm are chosen such that C is rotated to center pk in 
image I ^; that is, pk is at the center region of Ik . If Pm = Pt 
then I^ is the desired final image. 

To this end, we define spatial orders between scene 
points. We next describe how such orders can be used to 
compute the sequence, pi, ... ,Pm5 and then show how the 
orders can be efficiently computed (the SOFA representa¬ 
tion). 

3.2.1 Sequence Properties 

Let us first assume that the order of the scene points is given 
by the order of the x coordinate of their projections to the 
image plane of I^ (not necessarily within the FOV of I^). 
Note that since this is a non-trivial assumption, we show in 
Sec. 3.3 how this order can be (approximately) computed. 
Let p^ be a point at the center region of Iq . The relative rank 
of p^ and p^ in the ordered sequence determines whether to 
rotate C “to the left” or “to the right”. We next show how 
this is used to determine the first intermediate image, 
This is an iterative process that is repeated until the final 
image, is reached. 

Formally, let Sx be the sequence of features ordered by 
their x coordinate. The spatial order is defined by the fea¬ 
ture permutation cFx. That is, cFx{p%) is the rank of the fea¬ 
ture point p^ in Sx . The feature ranked i-th in Sx is given by 
Sx{i). Let Sxi'j) = Pt Sx{(^) = Pc- We use the order 
of points in Sx to choose a new point, Sx{P), to be centered 
in the next image. Assume without loss of generality that 
a < j, that is, Sx{(^) precedes Sxij) in Sx. We choose the 
new point Sx{P) such that (i) a < ^ < j and j ^ is min¬ 
imal; (ii) a supporting set of images for computing Sx{/3) 
exists (see Sec. 3.1). Then C is rotated to capture a new 
image so that Sx{P) is at its center region (see Sec. 3.4). 
This procedure is repeated until Sx{P) = p^ is centered in 

jm 

In a similar way, let Sy be the sequence of features or¬ 
dered by their y coordinate and defined by the feature per¬ 
mutation, ay. An additional constraint is added by ay for 
choosing a new point, p, for centering in I^. The feature 
ranked i-th in Sy is given by Sy{i). Each point p appears in 
both sequences, Sx and Sy, but may have a different rank- 
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ing. Let 7 ' be the ranking of (i.e., Sy{'y') = p^), and a' 
be the ranking of p^ (i.e., Sy{a') = p^). Assume without 
loss of generality that a' < 7 '. We choose a new point, p = 
Sy{P') = Sx{P), such that (i) a < P < j , a' < P' < j' 
and (7 — + ( 7 ' — /3') is minimal; (ii) a supporting set of 

images for computing p exists. 

3.3. SOFA Computation 

The 3D location of the scene points can be used to com¬ 
pute their projections to /^, and hence their orders, ax and 
ay. Since we are trying to avoid 3D reconstruction, each 3D 
point is represented by a set of matched features in the set 
of images, I (see Sec. 3.3.1). Consider the ideal case where 
(i) a perfect matching between image features is available 
in all images; (ii) the order of corresponding points is pre¬ 
served in all images; and (iii) there exists sufficient overlap 
between images. Note that if these three conditions hold, 
the spatial orders are identical in all images, and define the 
spatial orders of the corresponding 3D scene points. In this 
case ax and ay are obtained by simply combining the partial 
orders from all images. 

However, in practice there are matching errors and the 
order of corresponding points is not preserved in all images. 
To overcome this problem, we define the spatial orders on 
the 3D scene points, ax and ay, that are as consistent as 
possible with respect to the spatial orderings of their visi¬ 
ble projections to the set of images. Ranking the order of 
the scene points from a noisy set of rankings is cast as the 
well-known rank aggregation problem. In our case, each 
image provides a ranking of the 3D scene points according 
to the spatial locations of their projections to the image. We 
next describe an efficient approximate solution for comput¬ 
ing the correspondence between features in a large set of 
images and an approximate solution to the rank aggregation 
problem. 


3.3.1 Feature Correspondence via a Visual Dictionary 

An approximate feature correspondence can be obtained by 
using a dictionary of visual words (e.g., [ 21 ]), where each 
bin, Bi, represents the projection of the same scene point. 
Pi eV. In the rest of this section, we regard Bi as the rep¬ 
resentative of the projection of a scene point Pi. A straight¬ 
forward (but costly) alternative is to compute a pairwise 
matching between each pair of images (e.g., [19]). Using 
a dictionary allows us to overcome the time complexity in¬ 
volved in pairwise matching (at least 0{n^)). Its robustness 
is sufficient for our method, as demonstrated experimentally 
(see Sec. 5). In our implementation we use SIFT features 
[19] that are clustered using hierarchical K-means [24]. 


3.3.2 Rank Aggregation 

The widely accepted objective to minimize in rank aggrega¬ 
tion is the Kendall distance, that is, the pairwise disagree¬ 
ments between the full order and each of the partial orders 
computed in each image. Formally, the Kendall distance, 
K{a,ai), between the full order, a, and a partial order, ai, 
of the sequence. Si, is defined by 

K{a,ai)= h^{a(l),a(k)), 

i,keSi 

CFi{l)<CFi{k) 

where = 1 if i > j and ^>(i, j) = 0 otherwise. 

The rank aggregation problem is formally defined by 
minimizing 

\P 

<j* = arg min 

a 

i 

Since minimizing this objective was proven to be NP- 
hard, we use the Markov chain approximation to this prob¬ 
lem [12, 13]. In our method we employ two rank aggre¬ 
gation instances for the x and y coordinates independently. 
We next briefiy describe the Markov chain approximation 
for rank aggregation for the x coordinate. 

Let G = (U, ic) be a weighted and directed graph. A 
node Vi corresponds to the bin Bi. The weight, w{e), 
of a directed edge, e = (vi^Vj) G E, corresponds to the 
vote that dx{Bi) < dx{Bj). It is computed based on the 
spatial distances of the image points p^ G Bi and p^ G Bj 
in the image Ik G X. That is, 

hex 

where x{p) is the x coordinate of p and x{p^) < x{p^). To 
resolve conflicts in the order between Bi and Bj, e.g., due 
to correspondence errors, the directed edge, Cij = {vi^Vj) 
or Cji = {vj^Vi), with the smaller weight is discarded; that 
is, if w{eij) > w{eji) then we keep Cij. 

The Markov chain approximation is defined by a graph 
of states and a transition matrix, M, that corresponds to the 
probability of transition from one state to the other. The idea 
is that a sufficient number of steps in a random walk over the 
graph will end up in the state that corresponds to the element 
to be ranked last. If this state is removed, the process may 
be repeated until all elements are ranked. In our method, 
the set of states is defined to be V and Mij = w{e)lne, 
where e = {vi, Vj ) and Ue is a normalization constant so 
that the sum of each row in M is exactly 1. Assuming a uni¬ 
form probability distribution over the |U| states, defined by 
a vector x, the probability distribution after a random walk 
of k steps is M^x. The random walk eventually converges 
to the eigenvector y = My. In our method we obtain a 
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(a) (b) (c) (d) 

Figure 1. The user interface for rotating the camera, C, to center a point, in the intermediate view, P. In (a) and (b), is out of the 
FOV so an arrow indicates its direction. In (c) and (d), is within the FOV and marked by a red circle. In (d), p^ is in the center region 
of The images were taken from dataset urban!. 


17 46 356 399 



Figure 2. Intervals defined by the SOFA representation. The in¬ 
tervals of the left image are [17,356] and [118,969] in ax and ay, 
respectively. The intervals of the right image are [46,399] and 
[91,961] in ax and ay, respectively. The overlap between the two 
images is defined as the overlap of their intervals. That is, [46,356] 
and [118,961] in cTa; and ay, respectively. 

good approximation of y by running a few power iterations 
until a steady state is reached. The state with the highest 
probability from y is removed and the process is repeated 
until the full spatial ordering, is obtained. 

The expected limitation of using the dictionary for com¬ 
puting feature correspondence is false positive matching, 
which may affect the rank aggregation results. To reduce 
the number of false positives, a large dictionary is used 
(see Sec. 4). The other source of errors is that the order 
of the scene points is not preserved in the set of images, 
since it consists of objects with different depths, for exam¬ 
ple a tree or a pole. In practice the average Kendall distance 
between the global rank and the local rank in each image on 
the sets we considered is small (~5%). Moreover, for the 
application at hand, our method is able to deal with these 
errors successfully, as we show in Sec. 5. 

3.3.3 Sequence Construction 

The SOFA representation is used for determining the ro¬ 
tation direction and in particular the supporting set of im¬ 
ages. We compute an interval in the global ranking for each 
image. The overlap between two images is defined by the 


overlap between two such intervals (see Fig. 2). To over¬ 
come ranking errors, the interval of each image is computed 
using the medians of the first and the last deciles. 

We will now show how to choose a scene point P, which 
is represented by a bin B, so that its projection is centered 
in the next intermediate image, Let Sx and Sy be the 
X and y rankings of the dictionary bins {Bi}, computed us¬ 
ing rank aggregation as described above. The bins B^ and 
Be correspond to projections of the scene points, and 
Pc, that are viewed in the center regions of Id and P, re¬ 
spectively. The bin B^, and similarly B^ has a different 
rank in Sx and in Sy. Let axiB^) = 7 , cFy{BT) = 7 ', 
cFx{Bc) = a and cFy{Bc) = a'. We define for each B 
for which (Tx{B) G [ 0 ^, 7 ] and (Jy{B) G [Q^^ 7 ^] the value 
d{B) = \ax{B) — 7 I + \dy{B) — A list C is then ob¬ 
tained by sorting d in increasing order. We then traverse C 
to find a B for which a sufficiently large supporting set of 
images exists, in which features p G 5 are detected. Once 
such B is found, we apply the basic case solution (Sec. 3.1), 
where the fundamental matrices of the support set of images 
with respect to P are computed using the BEEM algorithm 
[17]. It receives as input the initial correspondences using 
the dictionary (Sec. 3.3.1) and is able to overcome corre¬ 
spondence errors. 

3.4. Visual User Interface 

In this section we describe the visual user interface that 
assists a photographer in the required rotation of camera C. 
Let p be the projection of a scene point P to the image plane 
of I captured by C. We propose a user interface that is su¬ 
perimposed on top of the live preview of the camera. We 
wish to notify the photographer of the location of p in the 
image plane of every frame / in the live preview. When 
p is within the EOV of /, then it is marked on the frame 
(Eig. 1(c)). Once p is marked, it is easy for the photogra¬ 
pher to rotate the camera to center it. When p is outside 
the EOV of /, only the direction from the center of / to 
p is marked (e.g., using an arrow (Eig. 1)). To further as¬ 
sist the photographer, the epipolar lines are marked on / so 
the photographer knows that p is at the intersection of their 
extensions. 

While the user rotates C, the marked location of p (or the 
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+ interface + interface Id 

Figure 3. Camera guidance examples. In the top row - dataset parkl; in the middle row - dataset park2; in the bottom row - dataset 
urbanl. Note that the green lines correspond to inlier epipolar lines, while the red lines correspond to outliers (see Sec. 3.1). 


directing arrow or the set of epipolar lines) is updated. For 
the first frame, the location of p as well as the pair of epipo¬ 
lar lines are computed using the supporting set of images 
(see Sec. 3.1). Since the user is asked to rotate the cam¬ 
era, we can assume that two frames of the live preview, fi 
and fj, are related by a homography transformation, Hij. 
Homography transformations can be composed; hence, the 
homography, Hqj, between any frame, fj, and /o = /, can 
be computed. The updated location of p in frame fj is given 
by pj = Hqjp, and the updated equation of an epipolar line 
i is given by ij = The homography Hi^j can be 

computed using RANSAC on a set of corresponding points 
in the two frames. For implementation details see Sec. 4. 

Our method can also be used when the camera is auto¬ 
matically controlled (e.g., a robot or a PTZ camera). In this 
case the interface is much simpler since the rotation can be 
directly conveyed using an axis and angle. Given the inter¬ 
nal calibration matrix, AT, that corresponds to image /, the 
rotation axis, a and angle, 0, for centering p are given by 

0 = arccos((i(0)^ • d{p))^ 

d = d{0) X d{p), 

where d{p) = K~^p^/\\K~^p^\ \ is the normalized direc¬ 
tion that corresponds to p, and 0 is the center of the image. 

4. Implementation Details 

Our method is run on a client-server configuration. The 
method without the interface (Sec. 3.4) was implemented in 
Matlab and run on a laptop (Toshiba Portege Z830-10D) 
as the server. The user images (P) were captured by a 
smartphone (client) and transferred via WiFi to the laptop. 
The interface was implemented for android smartphones us¬ 
ing OpenCV4Android [I]. SIFT [19] and SURF [5] are 
too slow when computed on the smartphone for real time 


computation of the homographies that are used by the in¬ 
terface. Instead we use the Fast Retina Keypoint (FREAK) 
[ 2 ], which provides real-time performance. 

Parameter setting: The depth of recursion in the hierar¬ 
chical K-means was set to (i = 3. Each layer consists of 
K = ^ clusters, where k = tkW is the total number of 
clusters, tk is a clustering parameter, and w is the num¬ 
ber of SIFT features in all the images. The performance 
of rank aggregation degrades significantly when the num¬ 
ber of clusters in the dictionary, k, is “too small”, i.e., the 
feature correspondence precision is low. Since the feature 
correspondence recall is not as important, we sacrifice it for 
high precision by setting tk high (typically t/c = 0.9). In our 
experiments we restricted the number of supporting images, 
|X^|, to a maximum of 4, which was shown to be sufficient 
for computing the intersection point. 

5. Experimental Results 

To test our method, we assembled photos from six 
scenes: three scenes that were captured in an urban envi¬ 
ronment, urhanl, urhan2 and urhanS, and three in public 
parks, parkl, park2 and parkS, each containing hundreds 
of images, captured by three different cameras (Samsung 
Galaxy S4, Samsung Galaxy Note, and Apple iPhone 4). 
The size of all images is 1280 x 720.^ 

Experiment 1 (our datasets): We tested our method on 
each of the datasets with 12 different pairs of initial and 
destination images. Examples of typical tests are shown in 
Fig. 3, Fig. 4 and in the figure on the first page, where the 
leftmost and rightmost images correspond to and Id re¬ 
spectively. In all of the examples the camera C needs to 
be rotated to the right in order to view Pt. The images in 

Uhese datasets will be publicly available upon paper acceptance. 
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Figure 4. Camera guidance example with the supporting sets of images from dataset urbanS. Top row - intermediate views and destination 
image; bottom row - (a) and (b) are the support set of 7°, and (c)-(e) are the support set of 7^. 



# images 

Success rate 

# iterations 

Cam. guidance 
time (min) 

Dictionary 
time (min) 

SFM time 
(min) 

urbanl 

173 

100% 

2.6 

3.2 

3.7 

215.9 

urban! 

197 

91.6% 

2.42 

1.7 

3.8 

289.2 

urban3 

235 

83.3% 

2 

1.9 

8.7 

433.3 

parkl 

90 

91.6% 

2.41 

1.4 

1.2 

74.8 

park! 

141 

91.6% 

1.66 

1.1 

3.1 

154.1 

parks 

206 

91.6% 

1.75 

0.8 

3.2 

277.8 


Table 1. Camera guidance method: quantitative results 


the middle show the progression of our algorithm, where 
in each iteration the captured image, P, is “closer” to Id- 
Moreover, the visual user interface is superimposed on the 
frames of 7^ and 7^. Examples of the support sets are shown 
in Fig. 4 (bottom row). Additional run examples are pre¬ 
sented in the supplemental material. 

A run is declared a success if is projected to the center 
region of Failure occurred in two scenarios. The first is 
when the camera is rotated to capture a new image that has 
no overlapping images in X. The algorithm can detect and 
alert the user about this case. This type of failure occurred 
rarely. The second is when the final image does not 
contain the projection of Pt in its center region. This type 
of failure occured once in our tests. 

The algorithm succeeds even when there is a large gap 
between 7^ and 7^, which requires the user to rotate the 
camera by over 100°. The assumption that the spatial or¬ 
ders of features are preserved in all images does not hold 
in our datasets due to moving objects (e.g., people), poles 
and trees, and feature correspondence errors (e.g., due to 
repeated structures on the large building in Fig. 4). In addi¬ 
tion, the computed fundamental matrices between pairs of 
images were sometimes inaccurate or missing. Our method 
copes with these challenges successfully. When only a sin¬ 
gle image in the support set is available, the user rotates the 
camera to position the epipolar line in the frame. This often 
results in an additional intermediate view. 

The method can also be applied when the point Pt is oc¬ 


cluded. See for example that the marked point (on the poles) 
in the support set of 7^ (Fig. l(a)&(b)) is occluded in 7^ by 
the tree. Hence, a direct computation of correspondence 
would fail in this case. Despite this, the point is success¬ 
fully marked in 7^ by the intersection of the corresponding 
epipolar lines. 

To quantify the results the following measures are used: 
(i) success rate (success/total); (ii) mean number of inter¬ 
mediate views; (iii) time required for each run given the 
dictionary; (iv) time required for dictionary construction. 
These results are summarized in Table 1. The success rate 
of the algorithm is very high, and the average number of 
intermediate views in our experiments is ^2. 

The overall running time of the algorithm consists of the 
offline dictionary construction step and the online camera 
guidance. The online camera guidance includes the reac¬ 
tion time of the user as well as the running time of the al¬ 
gorithm (SIFTs, homography, epipolar geometry and rank 
aggregation computation). The online components require 
about two minutes to run, which is reasonable as a proof 
of concept. We believe that this may be considerably im¬ 
proved using an optimized configuration, which will result 
in a real time application. 

Experiment 2 (photo-tourism datasets): Although our 
assumption is that a dataset, X, is captured by photogra¬ 
phers present at the scene close to the time at which the 
method is first used, we also tested our method with two 
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# im 

SFM 

time 

(min) 

Cam. 

recovery 

rate 

Success 

rate 

(SFM) 

Success 

rate 

(ours) 

20 

6.1 

8/20 

0/5 

1/5 

40 

18.4 

19.6/40 

2/5 

3/5 

60 

41.5 

35.8/60 

2/5 

3/5 

80 

78.5 

66.6 / 80 

3/5 

4/5 

100 

129.6 

81.2/100 

4/5 

4/5 

120 

186.4 

119.6/120 

5/5 

5/5 


Table 2. SFM results and running time for dataset park3 


publicly available datasets, Notre Dame and Trevi foun¬ 
tain [22]. Since most images in these datasets overlap, they 
were vertically cut in the middle to produce more challeng¬ 
ing, non-overlapping images. We tested the first step of our 
method on randomly chosen pairs of and Id. Additional 
steps require visiting these scenes. It seems that the points 
chosen by our method will result in rotating the camera to¬ 
wards Pt. The first iteration results are fully presented in 
the supplemental material. 

Comparison with SFM - running time: SFM may be 

used to project Pt to (see Sec. 3). Then, our user in¬ 
terface may be used to center the projection of P^ in the 
new captured image. But we wished to avoid using SFM 
due to its computational time, which is expected to be very 
high. To quantify this claim, we compared the running time 
of our method with a state-of-the-art SFM method [25, 26] 
based on “Bundler” [22] and parallelized using the GPU. 
The running time of SFM for the datasets is presented in 
Table 1 . As expected, it is much slower than our method. 
In all datasets, our method is between 50 to 85 times faster 
than SFM, and it may take hours for the SFM to run. It 
can also be seen that as the number of images grows, the 
running time of SFM grows about quadratically. 

Comparison with SFM - # of images: Since the number 
of images in the datasets was chosen somewhat arbitrarily, 
we tested the performance of the SFM and our method with 
different size subsets of images, randomly selected from the 
parks dataset. Two measures are used to quantify the per¬ 
formance. 

The first is the camera recovery ratio, |C'|/|C|, between 
the number of successfully recovered cameras, C', and the 
total number of cameras, C. As the number of images 
grows, this ratio grows. Only above 120 images is it close 
to 1; however, in this case, running the SFM takes hours. In 
scenes that were not captured beforehand, this duration is 
unacceptable. Table 2 presents the running time of the SFM 
method. Note that for each subset size (rows of Table 2), 
5 instances of random subsets were used, and the values in 
Table 2 are their average. 

The second measure is the success rate in five online runs 


of the camera guidance method (five runs per subset size). 
For each run, we use both the camera guidance with our full 
method, and the camera guidance based on directly com¬ 
puting the SFM (described in Sec. 3). The input to both 
methods, and Id, vary from run to run. The results are 
comparable, with a slight advantage towards our method. 
Table 2 presents the results for the second measure. These 
results confirm the necessity of a faster alternative to SFM 
in the camera guidance problem. It is important to note that 
some of the Id that the SFM failed to recover, where suc¬ 
cessfully guided to by our method; thus, making our method 
applicable to “fill gaps” in photo collections used by SFM 
as input. 

Image based panoramas: To demonstrate that images in 
our datasets are generally not related by homography trans¬ 
formations, we used 10 pairs of overlapping images (from 
all datasets) that are separated by translation as input to a 
RANSAC procedure to estimate homography transforma¬ 
tions. In most cases, there were no homography transfor¬ 
mation with more than 4 inkers (“false” homographies). In 
these cases, any point besides the 4 inliers is transferred to 
a wrong location (usually very far) from the expected pro¬ 
jection location. In two cases, a homography transforma¬ 
tion was found with more than 30 inliers; however, it cor¬ 
responds to planar surfaces in the images. In this case, any 
point not on the plane is transferred to a wrong location as 
well. 

6. Conclusions and Future Work 

We propose a new problem, called the camera guidance 
problem, whereby we wish to instruct a user to rotate his 
camera and capture an image such that it has overlapping 
FOV with a destination image. Our solution consists of two 
components. The first is to define the rotation and the sec¬ 
ond is to convey it to the user. 

Although SFM methods can be used to solve the camera 
guidance problem, we have shown that our method is much 
faster, with comparable performance. This is especially true 
in the common case, where dozens of images are required 
to obtain reasonable models of the scene and the cameras. 
We introduce the alternative SOFA scene representation and 
show that it can be efficiently computed using a rank aggre¬ 
gation approximation algorithm. It remains to be seen how 
the SOFA representation can be used for other tasks that 
require only the recovery of rough scene geometry. 

While it has been shown to be effective, our method is 
limited to specific settings, where the viewers are positioned 
on one side of the scene, e.g., a crowd in front of a stage. 
In addition, our method has been shown to tolerate a small 
number of moving objects and it is based on static regions. 
It would be interesting to study whether the moving objects 
can also be used for solving the camera guidance problem. 
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